Modelling Patterns for Deep Web Wrapper Generation
نویسندگان
چکیده
Interfaces of web information systems are highly heterogeneous. Additionally to schema heterogeneity they differ at the presentation layer. Web interface wrappers need to understand these interfaces in order to enable interoperation among web information systems. In contrast to the general scenario it has been observed that inside of application domains (e.g. air travel) hetergeneity is limited. More in detail web interfaces share a limited common vocabulary and use a small set of layout variants. Thus we propose the existence of web interface patterns which are characterized by these two aspects: the used vocabulary on the one hand and the common layout of pages on the other. These patterns can be derived from a domain model which is structured into an ontological model and a layout model. The paper introduces metamodels for ontological and layout models and describes a model driven approach to generate patterns from a sample set of web interfaces. We use a clustering algorithm to identify correspondences between model instances. This pattern approach allows for the generation of wrappers of deep web sources of a specific domain.
منابع مشابه
Data Extraction using Content-Based Handles
In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...
متن کاملSite-Wide Wrapper Induction for Life Science Deep Web Databases
We present a novel approach to automatic information extraction from Deep Web Life Science databases using wrapper induction. Traditional wrapper induction techniques focus on learning wrappers based on examples from one class of Web pages, i.e. from Web pages that are all similar in structure and content. Thereby, traditional wrapper induction targets the understanding of Web pages generated f...
متن کاملAutomatic Generation of Deep Web Wrappers based on Discovery of Repetition
A Deep Web wrapper is a program that extracts contents from search results. We propose a new automatic wrapper generation algorithm which discovers a repetitive pattern from search results. The repetitive pattern is expressed by token sequences which consist of HTML tags, plain texts and wild-cards. The algorithm applies a string matching with mismatches to unify the variation from the template...
متن کاملAutomatic Wrapper Generation and Maintenance
This paper investigates automatic wrapper generation and maintenance for Forums, Blogs and News web sites. Web pages are increasingly dynamically generated using a common template populated with data from databases. This paper proposes a novel method that uses tree alignment and transfer learning method to generate the wrapper from this kind of web pages. The tree alignment algorithm is adopted...
متن کاملWeb-Prospector - An Automatic, Site-Wide Wrapper Induction Approach for Scientific Deep-Web Databases
Wrapper induction techniques traditionally focus on learning wrappers based on examples from one class of Web pages, i.e. from Web pages that are all similar in structure and content. Thereby, traditional wrapper induction targets the understanding of Web pages generated from a database using the same generation template as observed in the example set. Applying such techniques to Web sites gene...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007